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Abstract 

Multivariate Time-Series (MTS) are ubiquitous, and are generated in 
areas as disparate as sensor recordings in aerospace systems, music and 
video streams, medical monitoring, and financial systems. Domain experts 
are often interested in searching for interesting multivariate patterns from 
these MTS databases which can contain up to several gigabytes of data. 
Surprisingly, research on MTS search is very limited. Most existing work 
only supports queries with the same length of data, or queries on a fixed set 
of variables. In this paper, we propose an efficient and flexible subsequence 
search framework for massive MTS databases, that, for the first time, 
enables querying on any subset of variables with arbitrary time delays 
between them. We propose two provably correct algorithms to solve this 
problem — (1) an B*-tree Based Bearch ( RBS ) which uses Minimum 
Bounding Rectangles (MBR) to organize the subsequences, and (2) a List 
Based Bearch (LBS) algorithm which uses sorted lists for indexing. We 
demonstrate the performance of these algorithms using two large MTS 
databases from the aviation domain, each containing several millions of 
observations. Both these tests show that our algorithms have very high 
prune rates (>95%) thus needing actual disk access for only less than 5% 
of the observations. To the best of our knowledge, this is the first flexible 
MTS search algorithm capable of subsequence search on any subset of 
variables. Moreover, MTS subsequence search has never been attempted 
on datasets of the size we have used in this paper. 


1 Introduction 

Many data mining application domains generate large multivariate time series 
(MTS) databases. Examples of such domains include earth sciences, music, 
video, medical monitoring, aerospace systems, and financial systems. Domain 
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experts are often interested in searching for particular patterns — waveforms over 
subsets of variables with some delays between them. 

The motivation for this research comes from applications in any domain 
where an entity can be described as a multivariate sequence and one needs to 
search for entities having specific characteristics defined by a particular com- 
bination of some or all of those features. Suppose that an airline has a large 
database of one million flights of multivariate time series that show the settings 
of the control surfaces (usually discrete signals), the pilot inputs (discrete), as 
well as the heading, speed, and readings from the propulsion systems (all usu- 
ally continuous). In many such databases, the number of recorded parameters 
from a modern aircraft is nearly 1000. The safety analyst may want to find all 
situations in the database that correspond to a “go-around” situation in which 
a landing has been aborted and the aircraft has been directed to circle back for 
another landing. 

One can find such situations using a subset of the fields in the time series 
database where the event “Landing Gear Retracted” occurs just after altitude 
descends below 2000 feet. Another search for indicators of an “unstable ap- 
proach” may include searching on parameters such as speed, descent rate, verti- 
cal flight path, and several cockpit configuration parameters. Again, this search 
would be done on about a dozen parameters out of the 1000 parameters that 
may be recorded on the aircraft. The events would be separated in time and 
may or may not occur on a particular flight. 

Fig. 1 shows an MTS from a real aviation dataset of CarrierX 1 . Each MTS 
contains the data collected from multiple sensors of an aircraft during a flight. 
We plot only six variables for clarity. In the figure, the x-axis refers to the 
different parameters while the y-axis refers to time. Typically, queries by the 
analyst may look like: 

1. Return all flights where the altitude monotonically changes from 10000 
ft to 5000 ft, speed decreases from 300 knots to 200 knots, and landing gear is 
down. Such a combination of parameter values may be precursors to unstable 
approaches while landing. 

2. Return all small-cap stocks whose daily price drops by 10% over 3 days 
just before a strong sell-off (30% over 10 days) in at least m out of K stocks 
and then increases by at least 15% over the remaining 30 days. This could be 
a signature indicative of insider-trading in an attempt to unfairly control the 
share prices in the specific sector. 

None of the current research in MTS search [1] [2] [3] [4] support the types of 
queries described here. Current algorithms in this area require that the query 
be of the same length as that of the entire MTS and that all queries be on a 
fixed set of variables (usually all the variables) . Additionally, current algorithms 
do not allow for any time lag between the variables in the query. 

In this paper we address the following problem: given a large database of 
multivariate time series data representing entities, we wish to provide a search 
technology that allows analysts to rapidly identify entities with particular char- 

1 We cannot release the name of the carrier due to the data sharing agreement. 
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Figure 1: Sample MTS dataset and query Q. x-axis refers to different param- 
eters and y - axis refers to time. Components of query and time delays are also 
shown. 


acteristics such as the scenarios described above. We assume that the user 
supplies a query consisting of waveforms over several variables — typically sub- 
stantially fewer than the total number of variables present in the database. 
Additionally, the user may choose (at search time) how many and which vari- 
ables to query, i.e., this need not be fixed in advance (during index- building 
time). This requires tremendous flexibility of the search algorithm. Also the 
query may cover any desired length of time up to the maximum length of the 
available time series. The waveforms may have some time-shifts between them. 
The user also supplies a threshold for each variable describing the maximum 
allowable difference between the query variable and the corresponding variable 
in any matches that are returned. The MTS search algorithm must return all 
matches with no false dismissals or false positives. The specific contributions of 
this paper are as follows: 

(1) We propose two algorithms — an R *- tree based search algorithm ( RBS ), 
and a list based search algorithm (LBS) for efficient searching of massive MTS 
subsequences defined on an arbitrary subset of variables with arbitrary time 
delays. 

(2) We have demonstrated the usefulness of our algorithm by searching for this 
“go around” pattern in a real commercial aviation dataset. 

(3) To the best of our knowledge, the datasets that we have used for testing 
the performance of our algorithms are much larger than those reported in the 
literature. 

The rest of the paper is organized as follows. In Section 2, we discuss work 
related to this area of research. In Section 3, we describe the notation and give 
a precise definition of the MTS search problem. In Section 4 we describe a 
fast UTS subsequence search algorithm leading to the MTS search algorithm in 
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Section 5. We analyze the algorithms in Section 6. In Section 7 we demonstrate 
the performance of our algorithm experimentally. We provide conclusions and 
descriptions of future work in Section 8. 


2 Related work 

In general, prior research on MTS is limited. Yang and Shahabi [1] present a 
PCA-based similarity technique for comparing two MTS. Given a database of 
MTS this technique first computes the covariance matrix between two MTS. 
Then eigenvectors and eigenvalues of the covariance matrix are used as a mea- 
sure of similarity between the MTS. This work is extended in [5] in which the 
authors propose the use of kernel PCA instead of traditional PCA. Distance- 
based index structure for MTS has been discussed by Yang and Shahabi [6] . The 
work by Lee et al. [4] addresses the problem of searching in multi-dimensional 
sequences. The multi-dimensional sequence is partitioned into subsequences, 
packed into MBR and then indexed using the i?*-tree scheme. Vlachos et al. [3] 
proposes an index structure for multi-dimensional time series which can handle 
multiple distance functions such as LOSS and DTW. 

There exist a plethora of work on subsequence search for univariate datasets 
(UTS). Popular techniques for performing entire length time series search in- 
clude the ones proposed by Keogh and Ratanamahatana [7] and the references 
therein. One of the early works of subsequence matching is by Faloutsos et al. 
(FRM) [8] in which the authors have proposed a Discrete Fourier Transform 
(DFT)/i?*-tree based indexing scheme. In this algorithm, input time series is 
first broken into overlapping window sequences of fixed length and then 6 DFT 
coefficients are extracted from each sequence. These 6-dimensional representa- 
tions are then packed into a minimum bounding rectangle (MBR) and indexed 
using an i?*-tree data structure. On receiving a query, the same process is ap- 
plied (extracting DFT coefficients) and then searched in the i?*-tree. Candidate 
MBRs are then checked with the actual database to remove false alarms. We 
compare this algorithm with our algorithms in the experimental section. A dual 
approach to this one, proposed by Moon et al. [9], is to decompose the input 
time series into disjoint sequences and the query sequence into sliding windows. 
However, as the size of the time series increases to millions of points, storing 
all the points in the index may become challenging. To alleviate this prob- 
lem, Traina et al. [10] recently proposed a technique of using multiple reference 
points to speed up the search. Our algorithm is different than theirs in the 
following sense: (1) [10] only talks about range queries whereas we can perform 
arbitrary subsequence matching and nearest neighbor search, and (2) unlike [10] 
which only works for univariate time series, we can perform multivariate sub- 
sequence search on an arbitrary number of variables and arbitrary time delays 
among those variables. Several other techniques exist for subsequence matching 
[ 11 ][ 12 ], 

At this point, we would like to mention that none of the existing algorithms 
for multivariate search is applicable in our problem setting. This is primarily 
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because most of them require that all the variables be used for the query. In 
our problem, we query over an arbitrary subset of variables and thus, to apply 
the existing algorithms, we need build and store a separate index for all possible 
combinations of input features. For example, the real CarrierX dataset that 
we have used in our experiments has 16 variables, and therefore to allow any 
subset of variables in the query, we need to build and store 16! « 2.1 x 10 13 (z.e. 
roughly 21 trillion) indices which is impractical for storage and computational 
reasons. This motivates us to provide a different solution to this problem which 
alleviates these issues by building a much smaller number of indices (linear in 
the number of features). 


3 Background 

In this section we define the notations that we have used in the rest of this paper 
and also present a formal problem definition. 

3.1 Notations 

First, we define a UTS database. A UTS database U-DB consists of \D\ UTS. 
For ease of explanation, we assume that each UTS is stored in a separate file; 
multiple UTS can also be stored in the same file in other applications. The z-th 
file stores a time series y d) = {y[ l \ y%\ ■ ■ ■ }, where each £ K or {0, 1}. The 
superscript refers to the file id while the subscript refers to the sample point in 
that file. Let yd) and yU) be two UTS sequences in two different files of U-DB. 
Then, (1) L (yd)) denotes the length (number of points) of yd), (2) y^-b] denotes 
the subsequence that includes entries in positions a through b for UTS in the 
z-th file, and (3) dist(y^i a:b \ y^a : b]) denotes the Euclidean distance between 
two univariate subsequences. 

It is natural to extend this definition to a multivariate database M_DB in 
which each file contains a set of vectors. Let d be the number of features or 
attributes across all the files in M_DB. Denoting vectors of dimension d in bold, 
we can similarly write the MTS stored in the z-th file as yd) = (yd' , yd-* ,...}, 
where or {0, l} d . Let w denote the size of a sliding window containing 

w consecutive samples of a UTS. 

3.2 Problem definition 

We first define e nearest neighbors e-NN of UTS. 

Definition 1 (e-NN UTS search) Given a user defined threshold e, U-DB, 
and a UTS subsequence Q of length w, (which we call the query), UTS e-NN 
returns all the subsequences Si of length w from U-DB, such that , dist(Si , Q ) < 

e. 

Our next definition deals with multivariate query Q. 
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Definition 2 (Multivariate Query Q ) A multivariate query Q consists of 
the following components: 

• any (sub) set of variables Q.var C {1, . . . , d} 

• a set of UTS subsequences {Q.seqi} for each such variable in Q.var, and 

• time delays 61 , 62 , - ■ ■ between the sequences in Q.var 

We are now in a position to define e-NN for MTS search. 

Definition 3 (e-NN MTS search) Given M_DB, a multivariate query Q, 
and user defined thresholds e = {ei,C 2 ,...} for each variable in Q, MTS e- 
NN returns a table {MTS A, Begin-of f set\, Beginjof f se< 2 , ■ ■ ■ , } such that (1) 
UTS e-NN is satisfied by every feature in Q, (2) the subsequences are found in 
the same MTS file, and (3) the Begin-off set’s are delayed by 61 , 62 , ■ ■ ■ in which 
Begin-of fsetj denotes the starting time point for Q.seqj. 


4 Fast UTS subsequence search 

When a query Q defined in Section 3.2 contains only one variable, it becomes 
a univariate time series search. For clarity and ease of exposition, we will start 
with solving this problem. We assume there is a minimal length for all queries 
and it is set to w. Smaller choice of w provides better granularity of search 
while increasing both the indexing and the search time. We first discuss the 
RBS algorithm in detail and then discuss the salient differences with our LBS 
algorithm. 

4.1 Overview of algorithm 

For a univariate query Q on the u-th variable, the brute-force method to find all 
its e-NN is to compare it with all subsequences of length L{Q) for every offset 
of time series j/M (Vi = 1, 2, . . . , |Z?|), which is time consuming and impractical. 

A classic data mining solution to speed up this process is to find a lower 
bound of the distance measure and use this bound to prune irrelevant candi- 
dates. This lower bound should be: (1) computationally more efficient than 
computing the distances between all subsequences, and (2) tight (very close) 
with respect to the original distance, so that we can prune sufficiently. 

One such technique for deriving a lower bound, also used in the literature 
[10] [13], is using a reference subsequence based on the triangle inequality. Fig. 2 
illustrates the basic idea of pruning. First, we randomly pick a subsequence R (of 
the same length w), and calculate its distance to all the remaining subsequences. 
Then, we order them by their distance to R. Only Si and S 2 are shown for 
clarity in the figure. Note that these two steps are done before the query Q 
arrives and only need to be done once. When a query Q is applied, we calculate 
the distance dist(Q,R). All candidates whose distances are not in the range 
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[dist(Q, R ) — e, dist(Q, R) + e] ( e.g . S 2 in Fig. 2) can be pruned. This is due to 
the triangle inequality: 

dist(Q , S 2 ) > | dist(Q, R) — dist(S 2 , -R)| > e. 

Finally, for all candidates in this range (e.g. Si in Fig. 2), we do an exact 
calculation to remove the false positives. In order to reduce the number of such 
false positives, we use multiple reference points to build several indices and 
then join the candidates from these indices to get the final set of candidates. 
We discuss this in detail in the next section. 

- € dist(Q, R ) + e 


R Si Q S 2 

Figure 2: Candidate subsequences (Si, S 2 ) ordered by their distance to a refer- 
ence subsequence R. When a query Q is applied, a range based on dist(Q,R) 
can be used to prune candidates. 


4.2 RBS algorithm details 

i?*-tree based algorithm (RBS) uses the concept of spatial indexing to store 
and retrieve time series subsequences. In order to make this indexing more 
efficient, we devise a novel technique of incorporating the triangular inequality 
directly into this i?*-tree scheme. We can control the amount of pruning and 
the corresponding search time by using multiple reference points against which 
the triangular inequality is applied. To the best of our knowledge, using spatial 
indexing along with multiple global reference points for time series subsequence 
search has never been explored before. 

We first discuss the index building algorithm followed by the search algo- 
rithm. Alg. 1 presents the pseudo-code of RBS build index. The inputs are 
U-DB and length of the sliding window w. The output is a set of spatial 
indices Index 1 , ..., Index r . In the first step, we select r subsequences ran- 
domly Ri , . . . , R r of size w from U-DB which we call reference points. Then, 
for each subsequence S of length w from the i-th UTS (j/W) in UJDB , we 
find the Euclidean distance of S from the fc-th reference point R k- Therefore, 
each subsequence of length w gets mapped to a 1-D point (its distance to Rk ) • 
Next, we arrange several such 1-D points into a minimum bounding rectan- 
gle or MBR as follows. Each entry of the MBR consists of the utsJd, min , 
max , BeginJOf fset, End-Offset, where min and max are the minimum and 
maximum values (here distances to Rk) of all points included in that MBR. 
BeginJOf fset and End JO f f set are the beginning and end time points of all 
the elements in this MBR. For any UTS, the first point included in the MBR is 
trivially {utsji, Dist, Dist , 1,1}, where Dist is the distance of the first sequence 
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to Rk ■ For all other subsequences, we first compute Dist, and then check if 
adding this point to the existing MBR will increase its marginal cost , a heuris- 
tic proposed by Faloutsos et al. [8] . Due to shortage of space we do not describe 
it here. If the new marginal cost (after adding the new point) is greater than the 
old cost (without the point), a new MBR is started with this new point as the 
sole entry, else the old MBR is updated. The CheckMC routine in the pseudo 
code performs this task. Once all the subsequences of utsji are processed, all the 
MBR’s are appended to file mbrk and the next UTS is processed. Finally, each 
of these mbr k files are indexed using an RTreeBuild routine and the spatial 
indices are saved on disk. 

We would like to point out that while Faloutsos et al. [8] also use MBR to 
combine subsequences to reduce the index space, they map each subsequence 
into 6 DFT coefficients while we map each subsequence into a single value viz. 
distance to the reference point. So in our case, each MBR is a two dimensional 
point, leading to better scalability. 


Algorithm 1: Build Index for RBS 
Input: UJJB , w 

Output: Indices Index i, . . . , Index r 
Initialization: Select r reference points R ±, . . . , R r ; 

begin 

for k = 1 to r do 

for uts-i in UTS-Database do 
nMBR <— 1; 

Dist <- dist[R k ,y^. s -\)', 

mbr[nMBR ) {utsJ,, Dist , Dist , 1, 1}; 

M axO f f set t— ( L[utsJ ) — w + 1); 
for j = 2 to MaxOffset do 
Dist <r- dist[R k , y'y.j+ w _ 1 ]); 

[ud,newMBR] e- CheckMC(TO&r, Dist ); 
if ud == 0 then nMBR = nMBR + 1; 
mbr [nMBR) newMBR ; 

Append mbr to file mbr 

IndeXk t— RTreeBuild(m&rfe); 

Save IndeXk, Rk', 


When a query Q of length w is provided, we use the search code shown in 
Alg. 2. The inputs in this case are the UTS query Q , U-DB, the set of indices, 
the set of reference points, w, and e. The output is e-NN of Q. First, for each 
reference point Rk, we find the distance Dk of the query from it. Then we 
perform a range query search {Dk — e, Dk + e} using the RTreeSearch routine. 
We call this step the first level of pruning. The output of the search code are 
a set of candidate MBR’s which intersect the query MBR. In the second level 
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of pruning, we intersect the candidate MBRs found using different reference 
points. This reduces the number of false alarms dramatically as we show in our 
experiments, leading to very high prune rate and very low search time. Once a 
compact candidate set is found, we do disk access to retrieve those candidates 
and remove false alarms. 


Algorithm 2: RBS e-NN Search on UTS 
Input: U-DB , Q, Index i, . . . , Inde x r , R \, . . . , R r , w, e 
Output: e-NN of Q 
begin 

e-NN <- 0; 

for k = 1 to r do 

D k = dist(Q.seqi,Rk ); 

Candk = RTreeSearch(/?rdea:fc, {D k — e,D k + e}); 

CandAll c— {Plfc = i Candk}', 
forall the {utsJ, b , e} £ CandAll do 
Fetch j/ !•“*?' from utsJ file on disk; 

Dist = dist(y^- l) ,Q.seqi)\ 

if Dist < e then e-NN <— e-NN (J {utsJ, b, e}; 


We now discuss how RBS handles queries longer than w in the following 
two cases: 

• L(Q) = nw ( n > 1): We first divide Q into n disjoint subsequences of 
length w, and search the indices set for each of them with the threshold 
e/ \J~(n). Finally, we do an exact calculation of full length candidates (over 
all n parts) to remove false alarms. The correctness of this approach relies 
on the following Theorem [8]. 

Theorem 1 If dist(Q , S) < e, then for at least one pair of disjoint se- 
quences Qi and Si of length w, we have dist(Qi, Sf) < e/^/Jn). 

• L(Q ) = nw + v (0 < v < w): We can ignore the last subsequence of 
length v and perform search on the nw disjoint subsequences as described 
before. We only consider the last subsequence when we perform the exact 
calculation. 

4.3 LBS algorithm details 

In RBS, the smallest unit of search is an MBR. Now, for one reference point, 
RBS has a prune rate directly proportional to the number of MBR’s searched 
times the number of points in that MBR. Although the search time for RBS 
can be very low, large sizes of candidate set increase the overall search time to 
fetch all the potential candidates from the disk. To alleviate this problem, we 
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present another novel algorithm LBS, in which the search unit is a subsequence 
in the input space. This algorithm directly exploits the triangular inequality 
to effectively prune bad candidates by choosing a random subsequence as a 
reference subsequence. Moreover, to increase the prune rate further, we have 
used multiple reference points. 

As before, the inputs to LBS are U JDB and length of the sliding window 
w. The output is a set of sorted lists as indices. In the first step, similar to , 
we compute the distances of all the subsequences from a few reference points 
R\, . . . , R r . We store these distances (as the key) along with the offset and 
UTS — id into a list called Indexk , for reference point Rk- In the next step we 
simply sort these k lists and store them along with the reference points. 

During searching, when a query Q of length w is provided, for each reference 
point Rk, we find the distance Distk of the query from Rk- Then we collect 
those candidates from Indexk whose key (distance) lies in the range Distk ± e. 
This is a direct application of the triangle inequality. As before, we intersect 
the candidate sets for all the reference points finally do a disk access to remove 
false alarms. We do not present the pseudo-code here due to shortage of space. 

5 Flexible MTS subsequence search 

We now describe our algorithm for MTS query search. In our problem setting, 
we have substantially more variables to index compared to the number of vari- 
ables given in a typical query. Moreover, the query variables are not known 
apriori which severely restricts the use of existing MTS search algorithms. The 
algorithm we propose here has excellent performance for the multivariate queries 
that we want to execute. 

As before, we split the discussion into two parts. The index building algo- 
rithm is very similar to the one presented for UTS search. Alg. 3 presents the 
pseudo code. The first step is to decompose the MTS database M.DB into 
a series of univariate time series databases U-DB W, . . . UJDB^ d \ one for each 
feature in the MTS. Then we select r reference points for each UTS indepen- 
dently, and use Alg. 1 to build indices for each of the d UTS’s. Thus for d 
features, we will have d x r number of sorted lists for LBS algorithm and d x r 
number of i?*-trees for RBS. We store these indices along with the reference 
points on disk. 

Given a search query Q having v sequences for v variables and v — 1 time 
delays between them, the goal of MTS search algorithm (Alg. 4) is to return all 
matching multivariate patterns from M_DB. To solve this, we first take the first 
variable (call it Q.var{ 1)) of Q and do a search on the index corresponding to 
feature Q.var( 1). The FindCandidates function in Alg. 4 performs this search 
by first finding a candidate set from each index file of Q.var(l) and then joining 
them over multiple reference points. This routine is similar to Alg. 2 (except the 
disk access part). This generates an MTS table as: {MTSSd, Begiri-of fseti}. 
Similarly, the next variable Q.var(2) is searched on the relevant index. These 
two searches on the indices correspond to the first level of pruning. At this 
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Algorithm 3: MTS Build Index using RBS 
Input: M_DB , w 
Output: Index for MTS search 

begin 

Convert M_DB into UJDB^, ..., U.DB (<*); 

for / = 1 to d do // each feature 

Select R.[ f \ ..., R { , f) for UJdB^- 
Index each UJDB^L using Alg. 1. 


point we prune the candidates further by joining these candidate sets ( Cand 12 ) 
and noting that (1) all candidates in candidate 1 and candidate 2 must have 
the same MTSJd , and (2) the begin offsets between any two candidates from 
the two sets must be delayed by an amount 6 1 . The JoinCandidate routine 
performs this join. By this second level of pruning, we add another column 
to the tabic for the second variable {MTSJ,d,Begin-offseti,Begin-offset 2 }- 
Note that until this point, we have not performed any actual disk access, and 
searched only on the indices. We could continue joining the candidate sets and 
create a compact set for all the variables in Q. However, in our experiments (not 
reported here), we notice that the size of the candidate set after the first two 
joins is very small and does not reduce further on joining other candidate sets. 
We validated this for several variables in the candidate sets; in most cases, the 
size of the candidate set was less than 5% of the total number of subsequences. 
Thus, heuristically it becomes redundant to search for the remaining variables in 
the index. Instead, we do a disk access to retrieve all candidates from Cand \2 to 
remove the false alarms. The resulting subsequences Cand are the true nearest 
neighbors of Q considering the first two variables. We continue to search the 
remaining variables Q.var { 3 : v) by retrieving them directly from the disk after 
noting that they must come from the same MTS and satisfy the specified time 
delays. 

6 Analysis of algorithms 

In this section analyze the properties of the algorithms. 

6.1 Correctness of LBS and RBS 

Theorem 2 Both LBS and RBS algorithms are correct i.e. they guarantee no 
false dismissals. 

Proof 1 The proof is based on the triangle inequality. For a reference point 
R, query Q and any arbitrary subsequence S , we can write by virtue of triangle 
inequality: 

\dist(Q , R) — dist(S, i?) | < dist(Q 7 S ). 
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Algorithm 4: MTS e-NN Search using RBS 
Input: M_DB , Q, Index , R\, . . . ,R r , w, e 
Output: e-NN of Q 
begin 

e-NN «- 0; 

for i = 1 to Q.var do // each, feature 

Candi t— FindCandidates(Q. var(i))\ 

Cand <— P) 2u° r_1 JoinCandidates^andj , Candi+i, Si) for 
c £ Cand do // remove false positives 

Fetch c from *-th MTS Disti = dist(c.seq\ , Q.seqi ); 

Disti = dist(c.seq 2 ,Q.seq 2 )', 


if Dist\ < £i and Dist 2 < e 2 and . . . then 

L e-NN ^ e-NN 


Aom; for any query Q which belongs to e-NN of S , dist(Q , S') < e. Combining, 
we get 

| dist(Q, R) — dist(S, f?)| < dist(Q , S) < e 

i.e. dist(Q,S) < e =>- |dzst(Q, f?) — dist(S, f?)| < e =>• dist(Q,R) — e < 
dist(S,R) < dist(Q,R) + e. Since in both LBS and RBS, we retrieve all 
sequences from the index in the range dist(Q, R) ± e, both these algorithms 
guarantee no false dismissals. 

6.2 Storage complexity of LBS' and BBS 

For LBS, we need to insert every subsequence in the sorted list for every UTS. 
Let Ti be the length (number of time points) of any MTS in the i-th file. The 
number of subsequences for the i-th MTS is, therefore, Tj — w + 1. Given there 
are d variables in each of the MTS files, the number of subsequences to process 
for the i-th MTS file is d(T) — w + 1). For \D\ total MTS files, we get the 
total number of subsequences as, d (B — u; + 1). For r reference points, 
the overall storage complexity is 0(rd — w + 1)) = 0(rdJ2 i=i B)- For 
RBS, the index storage complexity is 0(rdYf}^}\ Mf), where Mi are the number 
of MBR’s created from the i-th MTS. Since in general, M,; <C B , RBS has a 
much lower index storage complexity. 

6.3 Running time of LBS and RBS 

For LBS, the index building time is proportional to the number of distances 
computed for each subsequence: ic(B — w + 1). For d variables, r reference 
points and |Z?| MTS files, the overall running time for inserting all the elements 
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in the index is 0(wrdJ2[=i(J'i ~ w + 1)) = 0{wrd^R [=[ Ti)- Moreover, since 
r^Si=iCZi — w + 1) elements need to be sorted, the overall running time is 
the maximum of the sorting time and the insertion time. For RBS , we need 
to do some extra computation for checking the marginal cost of each point. 
Let the time required for it be A. Therefore, the overall time complexity is, 
0((w + ^)rdJ2\=[(Ti — w + 1)), where we have ignored the time to insert Mi 
MBRs in the i?*-tree. 

The query time for both the algorithms is bounded by: 0(max,; \Candi\) + 
0[w\Cand\ ), where the max is taken over all the candidate sets and the second 
term reflects the time for actual disk access and exact computation. 

6.4 Choice of reference points 

The choice of the reference points is crucial to the performance of our algorithms. 
From Th. 2, a point S is not a potential candidate to be the nearest neighbor 
of Q if | dist(Q, R) — dist{S 1 i?)| > e, where R is an arbitrarily chosen reference 
point. This is because, by triangular inequality, dist(Q , S ) > \dist(Q, R) — 
dist(S 1 R) | > e too. Therefore, such an S cannot belong to the set of nearest 
neighbors of Q. If, on the other hand, \dist(Q,R) — dist(S,R)\ < e, then we 
cannot prune S since dist(Q, S) can be greater or less than e. Therefore, the 
goodness of R can be evaluated based on the size of the following set: S = 
{5 1 : | dist(Q,R) — dist(S,R)\ < e}. Minimizing the size of S gives a good R. 
However, in the above formulation, Q is typically unknown until query time, 
making the optimization problem unsolvablc. Our heuristic is to choose multiple 
reference points randomly from the database with the hope that each such point 
will prune many candidates and we can only work with the intersection of these 
sets. Our extensive experimental results show the effectiveness of this simple 
heuristic by choosing 3-5 reference points. 


7 Experiments 

To validate the performance of the LBS and RBS algorithms, we have run a 
variety of tests using both univariate and multivariate datasets. All algorithms 
have been implemented in Matlab and run on a 64-bit 2.33 GHz quad core dell 
precision 690 desktop running Red Hat Enterprise Linux version 5.4 having 2GB 
of physical memory. We have measured the following quantities: 

• p - the prune rate (=1 — |C|/T), where C and T are sizes of the candidate 
set and the number of sliding windows 

• t - running time 
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7.1 Univariate dataset experiments 

7.1.1 Dataset description and experimental setup 

We have used 2 univariate datasets for testing our algorithms which have been 
used in the literature [8] [9] for UTS subsequence search. The first dataset is 
a random walk dataset generated synthetically (500,000 points). The second 
dataset is a stock market dataset having 329,112 entries. We have tested 3 
algorithms on these datasets: (1) the FRM algorithm using the adaptive MBR 
approach [8], (2) LBS , and (3) RBS. 

We have measured p and t at varying window sizes w (128, 256, 512, 1024) 
and the number of reference points (1~5). The default values of these param- 
eters are fixed at 512 and 3 respectively. For each choice of w and t, we have 
experimented with five different e. The choice of each e is such that the selectiv- 
ity ( i.e . actual number of nearest neighbors/T) ranges between 10 -6 ~ 10 -1 [8]. 
p and t at each measurement point is an average over ten randomly generated 
queries. We present the results in the next section. 

7.1.2 Results 

We summarize the results of FRM , LBS and RBS in Figures 3-6. Fig. 3 
shows the average and standard deviation of p and t for each e, over ten queries 
for the random walk dataset for different values of w. For most of the thresholds, 
we see that the prune rate of LBS is the highest. Also, the prune rates of RBS 
tend to be very close to the FRM algorithm for smaller number of reference 
points. One significant advantage of both LBS and RBS over FRM is that the 
prune rates for the former two algorithms can easily be controlled by increasing 
the number of reference points; however this increases the running time as well. 
Also, the prune rates for all these algorithms increase with increasing w, due to 
lesser number of windows to index. Fig. 4 demonstrates the performance of the 
algorithms for varying number of reference points. As expected, the prune rate 
increases with increasing number of reference points. We have similar results 
for the random walk dataset shown in the Figures 5 and 6. In this case, RBS 
has a higher prune rate compared to LBS or FRM. 

To sum up, both the LBS and the RBS algorithms offer an excellent prune 
rate for UTS search. LBS offers the best prune rate of all the 3 algorithms 
compared here, but as discussed before, suffers from large storage cost. On the 
other hand, RBS uses MBRs to group similar points and hence can reduce the 
storage cost dramatically. In many cases, this reduces the search time as well. 
However, since the unit of search is an MBR (containing several points) and not 
individual points (as in LBS), the prune rate of RBS is lower than LBS. It 
also needs to be mentioned that if the variables are not normalized, the MBR 
creation heuristic (/-adaptive in [8]) decides on the density of each MBR based 
on e. Too high a value of e packs more points per MBR, reducing the number of 
MBRs. This, in turn, reduces the prune rate. Lower values of e fragments the 
MBRs to only a few points in each. This increases the prune rate but increases 
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Figure 3: Variation of p and t (mean and std dev) for different w, averaged over 
ten queries for random walk dataset. Left column shows e vs. p and right column 
shows t vs. p for w = 128, 256, 1024 from top to bottom respectively. In most 
cases, LBS shows higher prune rate while prune rates of RBS are comparable 
to FRM. Also the running time of all the algorithms are comparable; in most 
cases, LBS has the least search time. 
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Figure 4: Variation of p and t (both mean and std dev) with the number of 
reference points, averaged over ten queries for random walk dataset. Left column 
shows e vs. p and right column shows t vs. p for |r| = 1, 2, 3 from top to bottom 
respectively. In most cases, LBS shows higher prune rate while prune rates of 
RBS are comparable to FRM. Also the running time of all the algorithms are 
comparable; in most cases, LBS has the least search time. 
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Figure 5: Variation of p and t (both mean and std dev) for different w, averaged 
over ten queries for stock market dataset. Left column shows e vs. p and right 
column shows t vs. p for w = 128, 256, 1024 from top to bottom respectively. 
For this dataset, RBS shows higher prune rate than FRM or LBS. Also the 
running time of all the algorithms are comparable; in most cases, LBS has the 
least search time. 


the index search time. We test with different values of e during building indices 
and always choose an e in the middle range of those reported here. 

7.2 Multivariate dataset experiments 

7.2.1 Dataset description 

We have used two large multivariate datasets for demonstrating the search ca- 
pabilities of LBS and RBS in the multivariate domain. To the best of our 
knowledge, these multivariate datasets are much larger than the datasets used 
in the literature for multi-dimensional time series search. The datasets are de- 
scribed next. 

C-MAPSS dataset: The first dataset is simulated commercial aircraft engine 
data. The dataset contains 6,875 (=1.01) full flight recordings sampled at 1 Hz 
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Figure 6: Variation of p and t (both mean and std dev) with the number of 
reference points, averaged over ten queries for stock market dataset. Left column 
shows e vs. p and right column shows t vs. p for |r| = 1, 2, 3 from top to bottom 
respectively. In most cases, LBS shows higher prune rate while prune rates of 
RBS are comparable to FRM. Also the running time of all the algorithms are 
comparable; in most cases, LBS has the least search time. 
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with 29 engine and flight condition parameters. This dataset has 32,640,967 
tuples. We have tested our algorithm 16 variables only. 

US Regional carrier dataset (CarrierX): The second dataset is a real life 
commercial aviation dataset of a US regional carrier consisting of 3,573 (=|Z?|) 
flights. Each flight contains 46 variables. Domain experts identified a subset of 
9 variables which are important. There are 22,207,852 tuples. 

For all the multivariate experiments, we have used w = L(Q) = 256 and 3 
reference points for both LBS and RBS. 

7.2.2 Results 

We have tested 5 randomly chosen queries, each with three different thresholds. 
For each query and threshold combination, the selectivities of each ranges from 
10 -7 ~ 10 -6 . We do not present the thresholds for each variable here due to 
shortage of space. 

The performance results of LBS and RBS on CMAPSS and CarrierX are 
presented in Table 1. The second column refers to the five different queries 
we have run along with the variables for each query. We have run each query 
with three different thresholds in increasing order (hence three rows for each 
query). The next three columns show the number of candidates generated for 
the first variable ( Cand\ ), the second variable (Canc^), and after joining these 
two candidate sets Cand \2 both for LBS and RBS. Column C exact is the 
actual number of these candidates which are found to be less than the threshold 
after doing the exact calculation. The smaller the size of Cand\ 2 , the fewer the 
number of actual disk access necessary. e-NN column refers to the actual number 
of nearest neighbors of the query after taking all the variables and time delays 
into consideration. The last two columns show the prune rate p = Candn/T 
and the query time for LBS. Since the query times for RBS are very similar, we 
do not report them here. For this experimental setup, the index building time 
for LBS and RBS on the CarrierX dataset are 7 hrs and 9 hrs respectively. 

These results show that for the two large multivariate datasets, for different 
queries and thresholds, the prune rates are very high (~ 95%). Also, we notice 
that the sizes of the candidate sets are smaller for LBS than RBS for all the 
queries thereby generating fewer false positives. However, the storage require- 
ment of LBS is non-trivial. For example, for CarrierX, we need to index ap- 
proximately 22 million distances using each reference point per UTS. The total 
storage requirement for the index will be (22,000,000 x (4+4+4) /( 1024 x 1024)) « 
250 MBytes, for each UTS, assuming we store {Dist, MTSJd, Begiri-of fset} 
for each window sequence as a float of (4+4+4) bytes. For RBS , let’s assume 
that (1) we have M MBRs on average for each reference point, and (2) we store 
{minJM BR,max-M BR, MTS jid, Begiri-of fset, End-of fset} for each MBR. 

In our experiments we have M = 5, 174, 619. Then the total storage require- 
ments (assuming 4 bytes for each) will be (5,174,619x(4+4+4+4+4)/(1024xl024)) 
98 MBytes, lower than that of LBS. Also note that the query time for most 
of the queries are extremely small considering the large sizes of the datasets. 

From these results we conclude that: (1) query execution time of LBS is 
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Figure 7: Typical pattern for “go around” in CarrierX dataset. Left plot shows 
airspeed (knots) vs time while right plot shows altitude (feet) vs. time. 


expected to be much lower than RBS due to higher prune rate, (2) RBS has 
relatively higher rate of false positives compared to LBS , and (3) the index 
storage requirements of LBS may be significantly higher compared to RBS. 
However, the choice of RBS vs. LBS is application dependent. 

7.3 Application: finding anomalous flights 

With the help of a retired commercial pilot, we have used the MTS search algo- 
rithm to find flight landing patterns which result in go around/aborted landing. 
In many cases, an aircraft on approach to landing needs to abort the landing, 
climb back on full throttle and try the landing again. This can happen due to 
improper landing configuration ( e.g . high energy approach, runway incursion, 
poor visibility). Currently, most safety analysts study these events based on 
only one variable at a time which generates a large number of false positives. 
These so-called exceedences or anomalies can be indicators of safety issues. The 
frequency of such events are tracked as a measure of safety of operations. These 
events can aid significantly in understanding the underlying causal factors. 

We have searched for such incidents in the CarrierX dataset using two 
variables: airspeed (in knots) and altitude (in feet). A domain expert (a retired 
commercial pilot) has helped us sketch a typical go around pattern as shown 
in Fig. 7. The left figure shows the variation in airspeed while the right one 
shows the variation in altitude. Using such a query as the input and thresholds 
100, 4000 for the two variables, we have searched the CarrierX dataset. The 
algorithm returned 10 hits. Fig. 8 shows 4 such flight profiles. We have plotted 
the altitude and airspeed on the same graph with the left axis as the airspeed 
and the right axis as the altitude. A visual inspection of each of these flights 
demonstrates the usefulness of the algorithm in finding all the “go around” 
patterns (no false positives). The highlighted portion shows the matched time 
series for each of these plots which shows that the algorithm is accurate at 
finding similar, not exact, motifs, i.e., it has good noise tolerance. The average 
time taken for running the query is approx. 12 secs. 
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Table 1: Results of LBS and RBS CMAPSS and CarrierX dataset for five different queries and three different thresholds per 
query. For both LBS and RBS , the prune rates are always greater than 0.95, signifying that less than 5% of the candidates 
need to be retrieved from the MTS database for exact calculations. 



\Candi | 

\Cand2\ 

\Cand\ 2 1 

n 


Prune rate p 



LBS 

RBS 

LBS 

RBS 

LBS 

RBS 



LBS 

RBS 


CMAPSS 


18409 

3007594 

738 

2477549 

52 

801400 

6 

6 

0.9999 

0.9741 

2.63 

1: (25, 27, 4) 

81409 

3263815 

7567 

2565309 

2668 

1003839 

17 

10 

0.9999 

0.9675 

102.91 


251981 

3841664 

81330 

2702600 

23694 

1454776 

540 

297 

0.9992 

0.9529 

291.8 


53585 

870835 

14969 

2390063 

1411 

266022 

252 

6 

0.9999 

0.9914 

6.91 

2: (20, 29, 5) 

179850 

1295644 

50502 

2454707 

13862 

481096 

1187 

17 

0.9995 

0.9844 

130.91 


317793 

1587719 

141444 

2633060 

58905 

633137 

20124 

259 

0.9981 

0.9795 

710.12 


528470 

4753958 

14725 

306706 

6171 

290593 

453 

8 

0.9998 

0.9906 

201.13 

3: (5, 15, 28) 
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4861533 

87236 

425813 

63690 

399972 

16289 
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0.9979 

0.9871 

770.18 
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177992 
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174391 
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79332 

1445 
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5 

3 

0.9999 
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23.1 
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194616 
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0.9997 
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41.1 
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4634240 

38648 
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22034 

6471 

0.9987 

0.9803 

99.13 
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4010042 

74609 

878140 

12945 

114419 

18 

9 

0.9996 

0.9963 

141.98 

5: (5, 23, 2) 

316085 

4101886 

164881 

1160134 

49908 

203004 

332 

49 

0.9983 

0.9934 

121.9 


771259 

4356479 

337201 

1521911 

150020 

375037 

4925 

479 

0.9951 

0.9879 

821.1 

CarrierX 
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96 

10226 

3 

3 

0.9999 
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3.69 

1: (29, 23, 28) 
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15 

15 
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27 

27 
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3 
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Figure 8: Examples of “go arounds” detected by our multi- variate search algo- 
rithm on CarrierX dataset. The matching regions are highlighted. 

8 Conclusion 

In this paper we present two algorithms LBS and RBS for finding multivariate 
subsequences from large MTS datasets which allows querying on any subset of 
variables with time delays between them. Both these algorithms guarantee no 
false dismissals. RBS algorithm is novel in the sense that it organizes subse- 
quences into MBRs and uses multiple reference points to reduce false positives. 
To the best of our knowledge, using spatial indexing along with multiple global 
reference points for time series subsequence search has never been explored be- 
fore. Experiments on two massive commercial aviation related MTS datasets 
each having several millions of tuples show that both these algorithms offer ex- 
cellent prune rates (greater than 0.95). The CMAPSS and CarrierX datasets 
that we have tested are much bigger than any of the MTS datasets used in the 
literature for multivariate subsequence search. As an application of the pro- 
posed method, we have shown how it can be used for finding a critical safety 
pattern from real aviation dataset, that of aborted landings. For future work, 
we plan to implement this algorithm on Map-Reduce for better scalability. 
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A Appendix 

Currently we are in the process of opensourcing the source code of this paper for 
massive search on multivariate datasets. In this section we briefly state which 
programs to run and the input/output of each code. 

Any user needs to run two programs to execute the search technique: 
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Build Index: Program Name: BatchBuildlndex 

• Input (in this order): 

— Full path of MTS folder 
— Full path of UTS folder 
— Full path of Index folder 
— Size of window 
— Features to use 
— Full path of Reference folder 
— Number of reference points 
— Size of index unit (default value: le+05) 

— Number of mts files 

• Output: 

— UTS in uts folder 
— Index files saved in index folder 

Search: File Name: MTS-Search 

• Input (in this order): 

— Full path of UTS folder 
— Full path of Index folder 
— Size of window 
— Full path of Query 
— Thresholds as a MATLAB vector 
— Full path of Reference folder 
— Number of reference points 

Output: 

• Number of nearest neighbors as an array with one neighbor per row 
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